37 research outputs found
Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
Many fundamental problems in natural language processing rely on determining
what entities appear in a given text. Commonly referenced as entity linking,
this step is a fundamental component of many NLP tasks such as text
understanding, automatic summarization, semantic search or machine translation.
Name ambiguity, word polysemy, context dependencies and a heavy-tailed
distribution of entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an effective
graphical model to perform collective entity disambiguation. Input mentions
(i.e.,~linkable token spans) are disambiguated jointly across an entire
document by combining a document-level prior of entity co-occurrences with
local information captured from mentions and their surrounding context. The
model is based on simple sufficient statistics extracted from data, thus
relying on few parameters to be learned.
Our method does not require extensive feature engineering, nor an expensive
training procedure. We use loopy belief propagation to perform approximate
inference. The low complexity of our model makes this step sufficiently fast
for real-time usage. We demonstrate the accuracy of our approach on a wide
range of benchmark datasets, showing that it matches, and in many cases
outperforms, existing state-of-the-art methods
Fifty years of spellchecking
A short history of spellchecking from the late 1950s to the present day, describing its development through dictionary lookup, affix stripping, correction, confusion sets, and edit distance to the use of gigantic databases
Enriching Query Flow Graphs with Click Information
The increased availability of large amounts of data about user search behaviour in search engines has triggered a lot of research in recent years. This includes developing machine learning methods to build knowledge structures that could be exploited for a number of tasks such as query recommendation. Query flow graphs are a successful example of these structures, they are generated from the sequence of queries typed in by a user in a search session. In this paper we propose to modify the query flow graph by incorporating clickthrough information from the search logs. Click information provides evidence of the success or failure of the search journey and therefore can be used to enrich the query flow graph to make it more accurate and useful for query recommendation. We propose a method of adjusting the weights on the edges of the query flow graph by incorporating the number of clicked documents after submitting a query. We explore a number of weighting functions for the graph edges using click information. Applying an automated evaluation framework to assess query recommendations allows us to perform automatic and reproducible evaluation experiments. We demonstrate how our modified query flow graph outperforms the standard query flow graph. The experiments are conducted on the search logs of an academic organisation's search engine and validated in a second experiment on the log files of another Web site. © 2011 Springer-Verlag Berlin Heidelberg
Robust and Collective Entity Disambiguation through Semantic Embeddings
Entity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. It finds its application in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Seman-tic Search, Reasoning and Question & Answering. We propose a new collective, graph-based disambiguation algorithm utilizing semantic entity and document embeddings for robust entity disam-biguation. Robust thereby refers to the property of achieving better than state-of-the-art results over a wide range of very different data sets. Our approach is also able to abstain if no appropriate entity can be found for a specific surface form. Our evaluation shows, that our approach achieves significantly (>5%) better results than all other publicly available disambiguation algorithms on 7 of 9 datasets without data set specific tuning. Moreover, we discuss the influence of the quality of the knowledge base on the disambigua-tion accuracy and indicate that our algorithm achieves better results than non-publicly available state-of-the-art algorithms